Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
نویسندگان
چکیده
As a key characteristic in audio-visual speech recognition (AVSR), relating linguistic information observed across visual and audio data has been challenge, benefiting not only audio/visual (ASR/VSR) but also for manipulating within/across modalities. In this paper, we present feature disentanglement-based framework jointly addressing the above tasks. By advancing cross-modal mutual learning strategies, our model is able to convert or audio-based features into modality-agnostic representations. Such derived representations allow one perform ASR, VSR, AVSR, manipulate output based on desirable subject identity content information. We extensive experiments different synthesis tasks show that performs favorably against state-of-the-art approaches each individual task, while ours unified solution tackle aforementioned
منابع مشابه
Continuous Audio-visual Speech Recognition Continuous Audio-visual Speech Recognition
We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audiovisual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal model...
متن کاملCross-modal Visual-audio Priming
This study assessed whether presenting visual-only stimuli prior to auditory stimuli facilitates the recognition of spoken words in noise. The results of the study indicate that this type of cross-modal priming does occur. Future directions for research in this domain are presented.
متن کاملAudio - Visual Speech Recognition
We have made signi cant progress in automatic speech recognition (ASR) for well-de ned applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach human levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding...
متن کاملCMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation
Visual and audio modalities are two symbiotic modalities underlying videos, which contain both common and complementary information. If they can be mined and fused sufficiently, performances of related video tasks can be significantly enhanced. However, due to the environmental interference or sensor fault, sometimes, only one modality exists while the other is abandoned or missing. By recoveri...
متن کاملContinuous Audio-Visual Speech Recognition
We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal mode...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2022
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v36i3.20210